The analysis here is on dataset of hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis, MN and St Paul, MN. This dataset was taken from https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume . This analysis will help in better urban planning, maintenance planning, lowering congestion level and increasing driver safety by forecasting the traffic volume for a future timeframe.
Column Names: Datatype: Meanings
The available data is hourly from 2nd Oct 2012 to 30th Sep 2018. So, we have hourly data for 6 years. While we will analyse old years data for traffic patern, we will mostly use last 1-2 years data (2016 Oct to 2018 Sep) for our forecast.
## holiday temp rain_1h snow_1h
## Length:48204 Min. : 0.0 Min. : 0.000 Min. :0.0000000
## Class :character 1st Qu.:272.2 1st Qu.: 0.000 1st Qu.:0.0000000
## Mode :character Median :282.4 Median : 0.000 Median :0.0000000
## Mean :281.2 Mean : 0.334 Mean :0.0002224
## 3rd Qu.:291.8 3rd Qu.: 0.000 3rd Qu.:0.0000000
## Max. :310.1 Max. :9831.300 Max. :0.5100000
## clouds_all weather_main weather_description date_time
## Min. : 0.00 Length:48204 Length:48204 Length:48204
## 1st Qu.: 1.00 Class :character Class :character Class :character
## Median : 64.00 Mode :character Mode :character Mode :character
## Mean : 49.36
## 3rd Qu.: 90.00
## Max. :100.00
## traffic_volume record_date date_time.1
## Min. : 0 Min. :2012-10-02 Min. :2012-10-02 09:00:00
## 1st Qu.:1193 1st Qu.:2014-02-06 1st Qu.:2014-02-06 11:45:00
## Median :3380 Median :2016-06-11 Median :2016-06-11 03:30:00
## Mean :3260 Mean :2016-01-04 Mean :2016-01-05 10:06:31
## 3rd Qu.:4933 3rd Qu.:2017-08-11 3rd Qu.:2017-08-11 06:00:00
## Max. :7280 Max. :2018-09-30 Max. :2018-09-30 23:00:00
## week_name month_name
## Length:48204 Length:48204
## Class :character Class :character
## Mode :character Mode :character
##
##
##
From the above plot, we can see that there is no missing value in existing records. But since data volume is huge, let’s analyse if obervation for any date or month is missing.
## [1] "Number of observation for Oct, 2012 Oct to Sep, 2013 Sep is : 9219"
## [1] "Number of observation for Oct, 2013 Oct to Sep, 2014 Sep is : 6752"
## [1] "Number of observation for Oct, 2014 Oct to Sep, 2015 Sep is : 2731"
## [1] "Number of observation for Oct, 2015 Oct to Sep, 2016 Sep is : 8307"
## [1] "Number of observation for Oct, 2016 Oct to Sep, 2017 Sep is : 10593"
## [1] "Number of observation for Oct, 2017 Oct to Sep, 2018 Sep is : 10602"
We know that for an year there can be 8760 records (or 8784 for 2016 being a leap year). By looking at data for 2014, 2015, it seems a lot of observations are missing. 2013 being very old compared to latest data, we decided to ignore this year’s data as well.
So, we will go ahead with data for 2016, 2017 and 2018 to analyse our use case.
## [1] "Summary of data with non zero snow variable is below:"
## holiday temp rain_1h snow_1h
## Length:63 Min. :266.7 Min. :0.00000 Min. :0.0500
## Class :character 1st Qu.:268.9 1st Qu.:0.00000 1st Qu.:0.0600
## Mode :character Median :273.2 Median :0.00000 Median :0.1000
## Mean :271.3 Mean :0.06222 Mean :0.1702
## 3rd Qu.:273.7 3rd Qu.:0.00000 3rd Qu.:0.2500
## Max. :274.3 Max. :0.98000 Max. :0.5100
## clouds_all weather_main weather_description date_time
## Min. :75.00 Length:63 Length:63 Length:63
## 1st Qu.:90.00 Class :character Class :character Class :character
## Median :90.00 Mode :character Mode :character Mode :character
## Mean :88.57
## 3rd Qu.:90.00
## Max. :90.00
## traffic_volume record_date date_time.1
## Min. : 359 Min. :2015-12-23 Min. :2015-12-23 12:00:00
## 1st Qu.: 888 1st Qu.:2015-12-28 1st Qu.:2015-12-28 21:00:00
## Median :2979 Median :2015-12-29 Median :2015-12-29 07:00:00
## Mean :2939 Mean :2015-12-31 Mean :2015-12-31 14:47:37
## 3rd Qu.:4933 3rd Qu.:2016-01-07 3rd Qu.:2016-01-07 16:00:00
## Max. :6438 Max. :2016-01-08 Max. :2016-01-08 15:00:00
## week_name month_name
## Length:63 Length:63
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Only 63 records has snow value greater than 0. This looks like data collection error because snow for only 63 days in Minneapolis state (over 6 years) does not look real. So, we can remove this variable from analysis to avoid error prone extrapolating.
Also, we saw that 2017 and 2018 had more than 8760 records. By analysing data, we found some duplicate records. Hence, we will create final dataframe after removing those duplicates.
We see from the Spectral densities plot that there exist seasonal trend in the data. Also, the ACF’s are damping exponentially.
We see seasonality of 24, 12, 7 in the data. So, data is not independent of time. Daily mean, weekly mean can vary over time based on month of the year. Hence, this model is non-stationary.
However, since the realization is really clumsy to check for any seasonality in the data, let’s check by picking less records from different year’s dataset.
## [1] "Realization of hourly data in chunks of 20 days are below for different years"
From the above plots, we can see that values are almost getting repeated every 24 hours. Also, we see that after every 5 days, traffic is little less for 2 days. This is another trend. Based on our domain knowledge when we know people travel more during weekdays for work (mostly 1 person per vehicle where as weekend trips are mostly in groups)
aic5.wge(traffic_volume_final$traffic_volume, type = "aic")
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of aic
## p q aic
## 17 5 1 13.04232
## 11 3 1 13.04902
## 9 2 2 13.05508
## 16 5 0 13.06028
## 13 4 0 13.06812
aic5.wge(traffic_volume_final$traffic_volume, type = "bic")
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of bic
## p q bic
## 17 5 1 13.04429
## 11 3 1 13.05042
## 9 2 2 13.05649
## 16 5 0 13.06197
## 8 2 1 13.06944
AIC estimate picked ARMA(5,1) as best model. Here we are estimating next 72 observation which is forecast for last 3 days.
##
## Coefficients of Original polynomial:
## 2.1042 -1.4630 0.2731 0.1211 -0.0808
##
## Factor Roots Abs Recip System Freq
## 1-1.7940B+0.8479B^2 1.0580+-0.2452i 0.9208 0.0362
## 1-0.6503B+0.2800B^2 1.1615+-1.4909i 0.5291 0.1447
## 1+0.3402B -2.9395 0.3402 0.5000
##
##
## [1] "ASE for above ARMA(5,1) model is : 2475821.2232016"
##
## Coefficients of Original polynomial:
## 1.2799 -0.3668 -0.0670 0.0207 0.0479 -0.0816 -0.0188 0.0296 0.0484 -0.0397 -0.0644 -0.0337 0.0615 0.0446 -0.0257 -0.0462 -0.0005 0.0042 -0.0045 -0.0140 0.0074 0.0169 0.0437 0.0767 -0.0874 -0.0076
##
## Factor Roots Abs Recip System Freq
## 1-1.8808B+0.9414B^2 0.9990+-0.2535i 0.9702 0.0396
## 1-1.3638B+0.9163B^2 0.7442+-0.7331i 0.9572 0.1238
## 1-1.6740B+0.9123B^2 0.9175+-0.5043i 0.9551 0.0800
## 1-0.5050B+0.9069B^2 0.2784+-1.0125i 0.9523 0.2073
## 1-0.9260B+0.8272B^2 0.5598+-0.9464i 0.9095 0.1650
## 1+0.4486B+0.8063B^2 -0.2782+-1.0783i 0.8980 0.2902
## 1+0.8944B+0.7956B^2 -0.5621+-0.9700i 0.8920 0.3336
## 1+1.2568B+0.7922B^2 -0.7932+-0.7957i 0.8901 0.3748
## 1+1.5369B+0.7872B^2 -0.9762+-0.5633i 0.8872 0.4167
## 1-0.0119B+0.7844B^2 0.0076+-1.1291i 0.8857 0.2489
## 1+1.7068B+0.7808B^2 -1.0929+-0.2936i 0.8836 0.4582
## 1+0.8734B -1.1449 0.8734 0.5000
## 1-1.7163B+0.7391B^2 1.1611+-0.0700i 0.8597 0.0096
## 1+0.0811B -12.3326 0.0811 0.5000
##
##
##
## Coefficients of Original polynomial:
## 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000
##
## Factor Roots Abs Recip System Freq
## 1-1.4142B+1.0000B^2 0.7071+-0.7071i 1.0000 0.1250
## 1-0.5176B+1.0000B^2 0.2588+-0.9659i 1.0000 0.2083
## 1-1.9319B+1.0000B^2 0.9659+-0.2588i 1.0000 0.0417
## 1+1.0000B+1.0000B^2 -0.5000+-0.8660i 1.0000 0.3333
## 1-1.0000B+1.0000B^2 0.5000+-0.8660i 1.0000 0.1667
## 1-1.7321B+1.0000B^2 0.8660+-0.5000i 1.0000 0.0833
## 1-1.0000B 1.0000 1.0000 0.0000
## 1+0.5176B+1.0000B^2 -0.2588+-0.9659i 1.0000 0.2917
## 1+1.4142B+1.0000B^2 -0.7071+-0.7071i 1.0000 0.3750
## 1+0.0000B+1.0000B^2 0.0000+-1.0000i 1.0000 0.2500
## 1+1.7321B+1.0000B^2 -0.8660+-0.5000i 1.0000 0.4167
## 1+1.9319B+1.0000B^2 -0.9659+-0.2588i 1.0000 0.4583
## 1+1.0000B -1.0000 1.0000 0.5000
##
##
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of aic
## p q aic
## 17 5 1 13.50918
## 11 3 1 13.51044
## 16 5 0 13.54437
## 9 2 2 13.54512
## 8 2 1 13.54600
##
## Coefficients of Original polynomial:
## 2.1939 -1.4928 0.2302 0.0858 -0.0273
##
## Factor Roots Abs Recip System Freq
## 1-1.8039B+0.8207B^2 1.0990+-0.1032i 0.9059 0.0149
## 1-0.6432B+0.1314B^2 2.4472+-1.2733i 0.3625 0.0764
## 1+0.2531B -3.9503 0.2531 0.5000
##
##
## [1] "ASE for above seasonal model is : 2888406.40003984"
##
## Call:
## arima(x = traffic_volume_final$traffic_volume, order = c(phi$p, 0, phi$q), xreg = traffic_volume_final[,
## c(2:4)])
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 intercept temp
## 2.1062 -1.4667 0.2751 0.1207 -0.0803 -0.7883 761.8147 8.8332
## s.e. 0.0117 0.0191 0.0168 0.0138 0.0062 0.0103 434.9775 1.5402
## rain_1h clouds_all
## 0.0302 0.0600
## s.e. 0.0403 0.1419
##
## sigma^2 estimated as 460881: log likelihood = -234102, aic = 468226
## Obs -0.001323695 0.001176312 0.0008094952 -0.02729275 0.03655779
## [1] 1.405653e-12
## Obs -0.001323695 0.001176312 0.0008094952 -0.02729275 0.03655779 -0.01968347 -0.04596793 -0.002236276 0.08202109 0.07318011 -0.01350892 -0.08307697 -0.02460241 0.06639319 0.06907025 -0.001318474 -0.0373869 -0.02511873 -0.007100658 -0.03238815 -0.04974949 -0.02745047 0.052335 0.165472 0.09035572 0.01027199 -0.004995573 -0.003888772 0.01449666 -0.002955487 -0.02841959 0.002375802 0.02396055 0.01382091 -0.01119042 -0.04524336 -0.02439875 0.008439027 0.02538246 0.004957913 -0.01954381 -0.01273011 0.002663449 -0.01552083 -0.008602525 -0.009270969 0.0323756 0.07157867
## [1] 0
##
## Call:
## arima(x = traffic_volume_final$traffic_volume, order = c(phi$p, 0, phi$q), xreg = cbind(t,
## traffic_volume_final[, c(2:4)]))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 intercept t
## 2.1070 -1.4675 0.2752 0.1207 -0.0802 -0.7892 861.7863 0.0032
## s.e. 0.0116 0.0191 0.0168 0.0138 0.0062 0.0103 433.4392 0.0022
## temp rain_1h clouds_all
## 8.3144 0.0289 0.0505
## s.e. 1.5455 0.0403 0.1419
##
## sigma^2 estimated as 460846: log likelihood = -234100.9, aic = 468225.8
## Obs -0.001366506 0.001003082 0.0006651114 -0.0272827 0.03659667
## [1] 1.369571e-12
## Obs -0.001366506 0.001003082 0.0006651114 -0.0272827 0.03659667 -0.01968326 -0.04602649 -0.002273615 0.08203495 0.07323695 -0.01345809 -0.08304797 -0.02453372 0.06649165 0.06914978 -0.001277692 -0.03737148 -0.0250877 -0.007073392 -0.03239088 -0.04981278 -0.0275641 0.05222153 0.1653656 0.09022845 0.01012151 -0.005152687 -0.003999976 0.01439389 -0.003057248 -0.02854696 0.002249962 0.02385174 0.01373148 -0.01127567 -0.04533555 -0.02446055 0.008376651 0.02533684 0.004902983 -0.01960978 -0.01278146 0.002619512 -0.01556889 -0.008672358 -0.009367844 0.03227591 0.07146556
## [1] 0
##
## Call:
## arima(x = traffic_volume_final$traffic_volume, order = c(phi$p, 0, phi$q), xreg = cbind(traffic_volume_final[,
## 2]))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ma1 intercept
## 2.1060 -1.4664 0.2750 0.1208 -0.0804 -0.7881 905.4582
## s.e. 0.0117 0.0191 0.0168 0.0138 0.0062 0.0103 433.1713
## cbind(traffic_volume_final[, 2])
## 8.3345
## s.e. 1.5347
##
## sigma^2 estimated as 460890: log likelihood = -234102.3, aic = 468222.6
## Obs -0.001299314 0.001212038 0.0008512073 -0.02734207 0.03652034
## [1] 1.404654e-12
## Obs -0.001299314 0.001212038 0.0008512073 -0.02734207 0.03652034 -0.01974294 -0.04597659 -0.002256005 0.08198994 0.07315742 -0.01347671 -0.08308763 -0.02462663 0.06637738 0.06898944 -0.001311623 -0.03742588 -0.02517311 -0.007201507 -0.03243947 -0.04980159 -0.027534 0.05229091 0.1653855 0.09026238 0.01033078 -0.0051034 -0.003889892 0.01443744 -0.002996019 -0.02844975 0.002346801 0.02389577 0.01381294 -0.01123498 -0.04526102 -0.02440017 0.008343268 0.02539306 0.004958378 -0.01953524 -0.01273768 0.002605641 -0.01549533 -0.008656762 -0.009268199 0.03235983 0.07151161
## [1] 0
From Ljung Box Test and from ACF of residuals (for all combinations of variables we tried), it is clear that the residuals are not white noise. Hence we decided not to use this model.
So, now let’s analyse NN model.
## [1] "ASE for above NN model is : 67122186.7034752"
Let’s analyse VAR model using the small dataset we created for NN model.
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 5 4 2 5
##
## $criteria
## 1 2 3 4 5 6 7
## AIC(n) 4.799266 4.103615 4.036316 4.001996 3.981915 3.989432 4.013313
## HQ(n) 5.066322 4.400344 4.362718 4.358071 4.367663 4.404853 4.458407
## SC(n) 5.479957 4.859938 4.868271 4.909583 4.965135 5.048284 5.147798
## FPE(n) 121.459128 60.584724 56.649635 54.747734 53.670114 54.087769 55.410058
## 8 9 10
## AIC(n) 4.027993 4.044769 4.015164
## HQ(n) 4.502760 4.549209 4.549276
## SC(n) 5.238110 5.330518 5.376545
## FPE(n) 56.246924 57.218630 55.571506
Finally since we have seen weekdays are mostly contributing to traffic volume, for simplicity, we have decided to use the data set that belongs to weekdays. Because we have seen that data our model with seasonality factor of 24 is the best model among others.
So, even though our model is not super helpful for weeknds but it is definitely going to be a huge help for weekdays forecast.
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of aic
## p q aic
## 11 3 1 13.68037
## 9 2 2 13.71840
## 16 5 0 13.72819
## 8 2 1 13.72971
## 10 3 0 13.73005
##
## Coefficients of Original polynomial:
## 0.6343 0.3927 -0.2151 -0.0382 -0.0060
##
## Factor Roots Abs Recip System Freq
## 1-1.3631B+0.4884B^2 1.3955+-0.3164i 0.6989 0.0355
## 1+0.5693B -1.7566 0.5693 0.5000
## 1+0.1596B+0.0215B^2 -3.7068+-5.7198i 0.1467 0.3415
##
##
## [1] "ASE for above seasonal model is : 4727624.14434723"
## [1] "ASE for above seasonal model is : 8002835.51148372"